Towards Distribution-Free Multi-Armed Bandits with Combinatorial Strategies
نویسندگان
چکیده
We consider the following linearly combinatorial multiarmed bandits (MABs) problem. In a discrete time system, there are K unknown random variables (RVs), i.e., arms, each evolving as an i.i.d stochastic process over time. At each time slot, we select a set of N (N ≤ K) RVs, i.e., strategy, subject to an arbitrarily constraint. We then gain a reward that is a linear combination of observations on selected RVs. Our goal is to minimize the regret, defined as the difference between the summed reward obtained by an optimal static policy that knew the mean of each RV, and that obtained by a specified learning policy that does not know. A prior result for this problem has achieved zero regret (the expect of regret over time approaches zero when time goes to infinity), but dependent on probability distribution of strategies generated by the learning policy. The regret becomes arbitrarily large if the difference between the reward of the best and second best strategy approaches zero. Meanwhile, when there are exponential number of combinations, naive extension of a prior distribution-free policy would cause poor performance in terms of regret, computation and space complexity. We propose an efficient Distribution-Free Learning (DFL) policy that achieves zero regret without dependence on probability distribution of strategies. Our learning policy only requires time and space complexity O(K). When the linear combination is involved with NP-hard problems, our policy provides a flexible scheme to choose possible approximation algorithms to solve the problem efficiently while retaining zero regret.
منابع مشابه
Semi-Bandits with Knapsacks
We unify two prominent lines of work on multi-armed bandits: bandits with knapsacks and combinatorial semi-bandits. The former concerns limited “resources” consumed by the algorithm, e.g., limited supply in dynamic pricing. The latter allows a huge number of actions but assumes combinatorial structure and additional feedback to make the problem tractable. We define a common generalization, supp...
متن کاملCombinatorial Multi-armed Bandits for Real-Time Strategy Games
Games with large branching factors pose a significant challenge for game tree search algorithms. In this paper, we address this problem with a sampling strategy for Monte Carlo Tree Search (MCTS) algorithms called näıve sampling, based on a variant of the Multiarmed Bandit problem called Combinatorial Multi-armed Bandits (CMAB). We analyze the theoretical properties of several variants of näıve...
متن کاملAnytime optimal algorithms in stochastic multi-armed bandits
We introduce an anytime algorithm for stochastic multi-armed bandit with optimal distribution free and distribution dependent bounds (for a specific family of parameters). The performances of this algorithm (as well as another one motivated by the conjectured optimal bound) are evaluated empirically. A similar analysis is provided with full information, to serve as a benchmark.
متن کاملOnline Multi-Armed Bandit
We introduce a novel variant of the multi-armed bandit problem, in which bandits are streamed one at a time to the player, and at each point, the player can either choose to pull the current bandit or move on to the next bandit. Once a player has moved on from a bandit, they may never visit it again, which is a crucial difference between our problem and classic multi-armed bandit problems. In t...
متن کاملSchemata Bandits for Binary Encoded Combinatorial Optimisation Problems
We introduce the schema bandits algorithm to solve binary combinatorial optimisation problems, like the trap functions and NK landscape, where potential solutions are represented as bit strings. Schema bandits are influenced by two different areas in machine learning, evolutionary computation and multiarmed bandits. The schemata from the schema theorem for genetic algorithms are structured as h...
متن کامل